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ABSTRACT 

Novel  challenges  must  be  met  when  designing  unix  implementations  for 
highly  parallel  shared-memory  MLMD  architectures.  Of  primary  importance 
is  the  need  to  avoid  serial  bottlenecks  whenever  possible,  so  that  the 
potential  speedup  of  such  machines  can  be  realized.  Critical  code  sections 
far  too  short  or  infrequent  to  seriously  impact  performance  on  today's 
machines  will  be  of  concern  on  very  large  machines  because  the  cost  of  each 
serial  section  rises  linearly  with  the  number  of  processors  involved.  In 
addition,  the  kernel  interface  must  provide  for  a  structured  and  natural  style 
of  general-purpose  parallel  programming.  We  present  the  approaches  taken 
to  satisfy  these  requirements  for  machines  such  as  the  NYU  Ultracomputer 
and  the  IBM  RP3.  We  also  describe  our  preliminary  parallel  implementation 
of  UNIX,  which  is  currently  running  on  an  eight-processor  prototype 
Ultracomputer. 

1.   Introduction 

Continuing  advances  in  microelectronics  have  inspired  many  to  consider 
assembling  large  numbers  of  powerful  processors  into  a  single  general-purpose 
machine  capable  of  solving  very  large  problems.  Two  such  projects  currently 
underway  are  the  NYU  Ultracomputer  (Gottlieb  et  al.  [83b])  and  the  IBM  RP3 
(Pfister  et  al.  [85],  Pfister  and  Norton  [85],  Norton  and  Pfister  [85],  Brantley  et  al. 
[85]),  the  former  a  shared-memory  design  and  the  latter  supporting  both  shared  and 
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private  memory.    At  NYU  we  have  been  investigating  the  adaptation  of  the  unex 
operating  system  to  these  architectures. 

However,  it  remains  to  be  demonstrated  that  such  machines  can  be  effectively 
utilized,  and  that  unix  is  well-suited  to  the  needs  of  such  an  environment.  There 
are  two  aspects  to  this  challenge.  First,  several  thousand  processors  must  be 
coordinated  in  such  a  way  that  their  aggregate  power  is  applied  to  useful 
computation.  Serial  code  sections  in  which  one  processor  works  while  the  others 
wait  become  bottlenecks  that  drastically  reduce  the  power  obtained,  even  if  the 
serial  section  is  so  small  or  infrequently  executed  as  to  be  entirely  acceptable  on  a 
machine  with  only  modest  parallelism.  Indeed,  the  relative  cost  of  a  serial 
bottleneck  rises  linearly  with  the  number  of  processors  involved.  Second,  the 
machine  must  be  programmable  by  humans.  Effective  use  of  high  degrees  of 
parallelism  will  be  facilitated  by  simple  languages  and  facilities  for  designing, 
writing,  and  debugging  parallel  programs. 

The  present  report  concentrates  on  operating  system  considerations  for  shared 
memory  parallel  processors.  The  software  ramifications  of  the  RP3  private 
memory  are  only  briefly  discussed;  a  more  detailed  presentation  will  appear  in  a 
future  paper.  We  begin  with  an  overview  of  the  computational  model  and 
architecture  of  the  NYU  Ultracomputer.  Next,  we  consider  issues  in  parallel 
programming  that  impact  the  operating  system,  in  particular  the  system  services 
required  by  parallel  code,  and  process  and  job  scheduling  issues.  The  following 
section  describes  in  somewhat  more  complete  fashion  the  programming  interface 
presented  by  the  kernel  for  parallel  programs.  Finally,  we  discuss  some  design 
issues  of  the  ultraparallel  operating  system  kernel  itself,  including  data  structures, 
resource  management,  and  synchronization  mechanisms.  These  mechanisms  have 
been  employed  in  the  implementation  of  a  parallel  undc  system  nmning  on  an 
eight-processor  prototype  Ultracomputer. 

2.   Machine  Architecture 

In  this  section  we  review  the  architectural  model  on  which  the  Ultracomputer  is 
based,  and  illustrate  the  power  of  this  model.  Although  this  idealized  machine  is 
not  physically  realizable,  a  close  approximation  can  be  built.  Elements  of  the  actual 
machine  design  are  briefly  described  in  order  to  illustrate  integrated 
hardware/software  mechanisms  for  bottleneck-free  coordination  of  a  very  large 
number  of  processors.  The  reader  is  referred  to  Gottlieb  et  al.  [83b],  Edler  et  al. 
[85],  and  the  references  therein  for  further  details. 

2.1.   Paracomputers 

An  idealized  parallel  processor,  dubbed  a  "paracomputer"  by  Schwartz  [80] 
and  classified  as  a  WRAM  by  Borodin  and  Hopcroft  [82],  consists  of  a  nimiber  of 
autonomous  processing  elements  (PEs)  sharing  a  central  memory  that  they  are 
permitted  to  read  or  write  in  a  single  cycle.  In  particular,  simultaneous  reads  and 
writes  directed  at  the  same  memory  cell  are  effected  in  one  cycle. 
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We  augment  the  paracomputer  model  with  the  "fetch-and-add"  (F&A) 
operation,  a  powerful  interprocessor  cx)ordination  primitive  that  permits  highly 
concurrent  execution  of  operating  system  algorithms  and  application  programs  (see 
Gottlieb  and  Kruskal  [81]).  Fetch-and-add  is  essentially  an  indivisible  add  to 
memory;  its  format  is  F&A(V,e),  where  V  is  an  integer  variable  and  e  is  an  integer 
expression.  The  operation  returns  the  (old)  value  of  V  and  replaces  V  by  the  sum 
V-l-e.  Moreover,  concurrent  fetch- and- adds  are  required  to  have  the  same  effect  as 
if  executed  in  some  (unspecified)  serial  order.  The  following  example  illustrates  the 
semantics  of  fetch-and-add:  Consider  several  PEs  concurrently  executing  F&A(I,1), 
where  I  is  a  shared  variable  used  to  index  into  a  shared  array.  Each  PE  obtains  an 
index  to  a  distinct  array  element  (although  one  cannot  predict  which  element  will  be 
assigned  to  which  PE),  and  I  receives  the  appropriate  total  increment. 

Fetch-and-add  is  a  special  case  of  the  more  genera]  fetch-and-i  operation 
(where  <J)  may  be  an  arbitrary  binary  associative  operator)  introduced  by  Gottlieb 
and  Kruskal.  The  classic  test-and-set  and  compare- and-swap  synchronization 
operations  are  both  special  cases  of  fetch-and-<J)  as  well,  and  the  familiar  load  and 
store  operations  are  degenerate  cases. 

2.2.   The  Power  of  Fetch-and-Add 

Using  the  fetch-and-add  operation  we  can  perform  many  important  algorithms 
in  a  completely  parallel  manner,  i.e.  without  using  any  critical  sections.  For 
example,  as  indicated  above,  concurrent  executions  of  F&A(I,1)  yield  consecutive 
values  that  may  be  used  to  index  an  array.  If  this  array  is  interpreted  as  a 
(sequentially  stored)  queue,  the  values  returned  may  be  used  to  perform  concurrent 
inserts;  analogously  F&A(D,1)  may  be  used  for  conou-rent  deletes.  The  complete 
queue  algorithms  contain  checks  for  overflow  and  underflow,  collisions  between 
insert  and  delete  pointers,  etc.  (see  Gottlieb  et  al.  [83a]).  Forthcoming  sections  will 
indicate  how  such  techniques  can  be  used  to  implement  a  totally  decentralized 
operating  system  scheduler.  We  are  unaware  of  any  other  completely  parallel 
solution  to  this  problem  and  note  that  given  a  queue  that  is  neither  empty  nor  full, 
the  concurrent  execution  of  thousands  of  inserts  and  thousands  of  deletes  can  all  be 
accomplished  in  the  time  required  for  just  one  such  operation. 

As  another  example,  consider  the  classical  readers-writers  problem  in  which 
two  classes  of  processes  are  to  share  access  to  a  central  data  structure.  One  class, 
the  readers,  are  permitted  to  execute  concurrently,  whereas  the  writers  require 
exclusive  access.  Although  there  are  many  solutions  to  this  problem,  only  the 
fetch-and-add  based  solution  given  by  Gottlieb  et  al.  [83a]  has  the  crucial  property 

that  during  periods  of  no  writer  activity,  no  critical  sections  are  executed-^. 


^Most  other  solutions  require  readers  to  execute  small  critical  sections  to  check  if  a  writer  is 
active  and  indivisibly  announce  their  own  presence.  The  "eventcount"  mechanism  of  Reed  and 
Kanodia  [79],  although  completely  parallel  detects  rather  than  prevents  the  simultaneous  activity  of 
readers  and  writers. 
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2.3.   Hardware  Realization 

The  paracomputer  is  not  physically  realizable,  due  to  fan-in  (and  other) 
limitations.  Furthermore,  if  concurrent  fetch-and-add  or  load  operations  were  to  be 
serialized  at  the  memory  of  a  real  parallel  computer,  then  we  would  lose  the 
advantage  of  parallel  coordination  algorithms,  having  merely  moved  the  critical 
sections  from  the  software  to  the  hardware. 

In  fact,  a  parallel  processor  closely  approximating  our  idealized  paracomputer 
can  be  built  as  described  in  Gottlieb  et  al.  [83b].  The  resulting  NYU  Ultracomputer 
uses  a  message  switching  network  with  the  topology  of  Lawrie's  [75]  H-network 
(equivalently,  the  SW  Banyan  of  Goke  and  Lipovsky  [73])  to  connect  N  =  2^ 
autonomous  PEs  to  a  central  shared  memory  composed  of  N  memory  modules 
(MMs).  Thus,  the  paracomputer's  single  cycle  access  to  shared  memory  is 
approximated  by  a  multicycle  connection  network. 

When  concurrent  loads,  stores,  and  fetch-and-adds  are  directed  at  the  same 
memory  location  and  meet  at  a  switch,  they  can  be  combined  without  introducing 
any  delay  (see  Klappholz  [80],  and  GottUeb  et  al.  [83a]).  Since  combined  requests 
can  themselves  be  combined,  any  number  of  concurrent  memory  references  to  the 
same  location  can  be  satisfied  in  the  time  required  for  one  central  memory  access. 
It  is  this  property  that  permits  the  bottleneck-free  implementation  of  many 
coordination  protocols. 

The  impact  of  network  latency  on  performance  is  reduced  by  associating  a  local 
cache  memory  with  each  PE.  Frequently-used  program  code  and  data  can  be 
accessed  in  (approximately)  a  single  processor  cycle  when  resident  in  the  cache. 
Thus  the  network  latency  is  eliminated  from  many  memory  accesses,  and  all 
accesses  benefit  from  the  reduced  network  traffic. 

Unfortimately,  cacheing  of  read-write  shared  variables  presents  a  coherence 
problem  among  the  various  caches.  Different  caches  will  in  general  come  to  contain 
different  values  for  the  same  variable,  and  updates  of  the  corresponding  memory 
cell  will  occur  out  of  sequence.  Means  are  needed  to  ensure  synchronized,  coherent 
access  among  the  multiple  PEs  for  variables  used  for  coordination  or  data 
transmission.     An    obvious    mechanism,    which    is   employed   in    the   prototype 

Ultracomputer,  is  merely  to  prevent  cacheing  of  read-write  shared  variables*.  User 
level  software  is  responsible  for  arranging  the  correct  "cacheability"  status  of  each 
memory  access. 

Because  the  cache  hit  ratio  is  a  major  factor  in  machine  performance, 
maximizing  cacheability  is  an  important  function  of  the  compiler  and  operating 
system  software.    This  suggests  supporting  a  more  elaborate  scheme  in  which 


^Because  of  the  stochastic  nature  of  memory  access  through  the  H-network,  this  may  not  be 
sufficient  to  insure  that  the  synchronization  is  maintained.  If  the  processor  or  cache  is  capable  of 
issuing  memory  requests  and  proceeding  without  waiting  for  acknowledgment  from  the  network, 
then  for  code  sensitive  to  synchronization  a  further  mechanism  (a  "FENCE"  operation)  is  needed  to 
guarantee  that  updates  are  sequenced  correctly  (Collier  [81]). 
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shared  variables  that  are  accessed  read-only,  or  accessed  only  privately  during  a 
particular  code  segment,  may  be  cacheable  during  execution  of  that  segment  (see 
McAuliffe  [86]). 

3.   Parallel  Programming 

3.1.   Levels  of  Parallel  Control 

We  consider  applications  programming  of  Ultracomputer-like  parallel 
computers  primarily  from  the  standpoint  of  operating  system  design.  That  is,  we 
study  the  requirements  placed  on  the  operating  system  by  the  desire  to  support 
effectively  the  number  of  different  styles  of  programming  permitted  by  the  shared- 
memory  MIMD  model.  While  we  see  great  potential  in  automatic  paralleUzation  of 
sequential  programs,  we  assume  here  for  purpose  of  discourse  only  that  programs 
are  designed  originally  for  a  shared-memory  MIMD  computer^. 

Much  work  on  conciurent  programming  has  focused  on  a  high-level,  block- 
structured  paradigm  of  parallel  process  control.  Under  this  model  parallel  code  is 
structured  in  closed-form  constructs  with  implicit  synchronization  at  the  end  of  each 
parallel  block;  explicit  synchronization  and  "fork/join"  operations  (Conway  [63], 
Dennis  and  Van  Horn  [66])  are  discouraged  or  disallowed.  The  argument  is  that 
the  resulting  parallel  code  is  simpler,  clearer,  and  easier  to  debug  than  code  with 
unrestrained  and  unstructured  process  creation/destruction  and  synchronization. 
Furthermore,  this  programming  style  obviates  in  many  cases  the  need  for  explicit 
shared/private  declarations.  That  distinction  is  instead  implicit  in  the  usual  static 
scope  rules.  Thus,  for  example,  a  variable  that  is  visible  within  a  block  defining  an 
"iteration"  of  a  parallel  loop,  but  declared  in  a  larger  enclosing  scope,  is  taken  to  be 
shared  during  execution  of  the  parallel  loop;  a  variable  declared  within  the  block 
itself  has  scope  of  an  individual  iteration  and  hence  is  private.  Finally,  automatic 
optimization  of  parallel  code  is  facilitated  when  the  parallel  structure  of  a  program 
can  be  readily  detected  by  the  compiler.  It  should  be  possible,  for  example,  for  an 
optimizing  compiler  to  implement  a  fine  granularity  of  control  over  the  cacheability 
of  data  areas,  with  greater  reliability  (and  thus  avoidance  of  cache  coherence  errors) 
than  could  be  specified  by  the  programmer. 

However,  the  utility  and  effectiveness  of  this  high-level  parallel  programming 
style  is  not  yet  demonstrated.  Furthermore,  the  volatile  parallelism  faciUtated  by 
these  closed-form  parallel  constructs  will  not  be  needed  in  all  programs.  When 
parallel  processes  need  to  synchronize  very  frequently,  the  overhead  involved  in  the 
process  creation  and  termination  operations  invoked  by  these  constructs  will  become 
significant.  This  overhead  can  be  alleviated  to  some  extent  by  pre-spawning 
processes  as  described  below.  However,  many  parallel  applications  will  be  more 
suited  to  a  lower-level  programming  style.    One  such  approach  involves  the  initial 


'in  particular,  the  automatic  techniques  of  Kuck  (see  Kuck  and  Padua  [79])  and  Kennedy  (see 
Kennedy[80],  Allen  and  Kennedy  [84])  will  be  important  for  running  both  existing  and  new 
applications. 
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creation  of  a  fixed  number  of  long-lived  processes,  usually  smaller  than  the  number 
of  PEs.  Synchronization,  scheduling,  and  memory  management  become  entirely  the 
responsibility  of  the  appUcation  programmer.  Here  the  syntactic  structure  of  the 
program  provides  no  information  regarding  the  dynamic  parallelism  or  the  sharing 
of  data. 

As  we  will  see  in  a  subsequent  section,  an  attractive  synthesis  of  the  two  styles 
may  be  obtainable.  We  consider  implementing  these  scheduling  and  management 
functions  in  usermode  code  which  is  part  of  the  runtime  environment  provided  by  a 
language  compiler.  In  the  ideal  case  this  would  afford  the  advantages  of  high-level 
structured  parallel  programs,  while  the  creation,  destruction,  synchronization,  and 
management  of  parallel  threads  of  control  are  accomplished  without  operating 
system  overhead.   Ramifications  of  these  programming  models  are  discussed  below. 

3.2.  Parallel  Constructs 

Note  that  programming  in  the  MIMD  parallel  environment  need  not  be 
radically  different  from  conventional  sequential  prograimning.  We  consider  parallel 
languages  which  are  variants  of  conventional  procedural  languages,  augmented  only 
with  a  shared/private  attribute  for  declared  variables  and  a  small  number  of  explicit 
parallel  control  constructs:  The  parallel  loop,  in  which  the  iterates  are  executed  in 
parallel  instead  of  serially,  is  an  obvious  parallel  extension  of  the  loop  construct 
foimd  in  every  procedural  language  (see  Gosden  [66],  Droughon  et  al.  [67], 
Lundstrom  and  Barnes  [80],  and  Davies  [81]).  The  parallel  compound  statement,  or 
parallel  block,  contains  inhomogeneous  statements  that  are  to  be  executed  in 
parallel.  It  is  also  a  popular  structure  and  has  appeared  many  times  (e.g,  Dijkstra 
[68],  Brinch-Hansen  [73],  and  the  collateral  expression  of  ALGOL  68).  Neither 
construct  contains  explicit  reference  to  processors.  The  PEs  themselves  are  a 
resource  that  is  only  indirectly  available  to  the  program.  These  (and  other)  parallel 
constructs  generate  processes  which  run  on  PEs.  Potentially,  all  of  the  processes 
generated  by  a  parallel  construct  could  execute  simultaneously.  Whether  or  not  this 
actually  occurs  depends  on  the  scheduling  policy,  system  load,  and  other  factors. 

3.2.1.  Barrier  Synchronization  While  it  is  important  to  minimize  the  idle 
processor  time  caused  by  synchronization  among  parallel  processes,  such 
synchronization  is  occasionally  necessary.  A  common  form  of  synchronization 
occurs  when  each  of  the  processes  executing  a  parallel  code  section  must  wait  at  a 
"barrier",  or  synchronization  point,  until  they  have  all  reached  that  point. 
Algorithms  for  barrier  synchronization  have  been  given  by  Rudolph  [82]  and 
Kniskal  [81].  Other  important  types  of  synchronization  are  discussed  subsequently 
in  the  context  of  the  parallel  kernel  design.  Section  6.3. 

3.3.  Implementation  of  Parallel  Constructs 

Having  discussed  the  benefits  of  a  "high-level"  programming  style,  we  now 
consider  in  detail  whether  that  style  can  be  supported  effectively.  In  particular, 
potential    parallelism    must    not    be    sacrificed    due    to    the    parallel    language 
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implementation.    We  shall  see  that  the  granularity  of  parallelizable  processes  is  a 
crucial  barometer  of  the  effectiveness  of  the  implementation. 

The  basic  mechanism  provided  for  creating  parallelism  is  the  spawn  operation, 
which  is  used  to -support  the  parallel  loop  and  parallel  block  constructs.  Spawn  is 
fundamentally  an  n-way  fork  of  control,  in  which  n  identical  subprocesses  are 
created.  The  subprocesses  are  made  available  for  scheduling  on  any  available  PEs, 
in  a  manner  to  be  discussed.  We  assume  here  that  the  "parent"  process  waits  for 
the  termination  of  its  spawned  "children",  which  occurs  automatically  at  the  end  of 
the  parallel  code  block.  Hence  the  parent  process— and  thus  the  subsequent 
program  statements— are  synchronized  with  the  completion  of  the  spawned  parallel 
processes. 

The  translation  into  runtime  code  of  a  parallel  loop,  with  n  homogeneous 
iterates,  might  make  use  of  the  following  operating  system  primitives.  First,  a 
spawn(n)  is  executed  by  the  original  (parent)  process.  It  stores  the  value  n  in  a 
shared  children  variable,  adds  n  items  to  the  system  work  queue,  and  executes  a 
form  of  wait  so  as  to  block  imtil  resumed  subsequently  by  a  terminating  child 
process.  The  terminate  function  is  executed  by  the  spawned  child  processes  at  the 
end  of  the  loop  body.  Terminate  executes  F&A(c/i/Wren,  — 1);  if  this  drives  children 
to  zero  then  the  current  process  is  the  last  terminating  child  and  thus  resumes  the 
parent  process. 

An  obvious  implementation  for  the  central  work  pool  is  the  fetch-and-add 
based  parallel-access  queue  that  was  mentioned  earlier. 

3.4.   Performance  Issues  in  Parallel  Control 

There  are  two  significant  performance  criteria  by  which  any  implementation  of 
these  operating  system  primitives  must  be  evaluated.  First,  we  wish  to  avoid 
algorithms  that  require  time  linear  (or  worse)  in  the  number  of  spawned  processes. 
Hence  it  is  unacceptable  to  implement  a  spawn  of  n  processes  by  a  sequence  of  n 
insert  operations  on  a  system  process  queue.  Instead,  a  spawn  operation  inserts  a 
single  item  with  multiplicity  n  on  the  process  queue,  and  the  PEs  deleting  such  an 
item  complete  the  creation  of  the  children  in  parallel.  Together  with  fully  parallel 
memory  allocation  routines,  this  will  largely  prevent  the  occurrence  of  diminishing 
marginal  benefits  as  the  degree  of  parallelism  is  increased. 

However  the  overhead  (in  absolute  terms)  of  these  operations  is  also  crucial, 
because  it  determines  the  minimum  granularity  of  parallel  operations  that  can  be 
efficiently  spawned.  Thus,  although  the  basic  unit  of  parallelism  provided  by  the 
language  constructs  is  the  program  statement,  it  is  clear  that  spawning  processes 
that  each  execute  a  trivial  statement  (e.g.,  one  assignment  or  arithmetic  operation) 
would  be  inefficient.  While  careful  design  of  portions  of  the  operating  system  can 
reduce  this  overhead,  the  facilities  described  to  this  point  must  be  considered  useful 
only  for  relatively  large-granularity  parallel  operations.  This  limitation  can  be 
alleviated  in  several  ways. 
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Parallel  loop  iterates  of  smaller  granularity  can  often  be  supported  with  a  policy 
known  as  chunking.  When  the  number  of  parallel  iterates  (n)  is  much  larger  than 
the  number  of  available  PEs  {N),  or  the  number  of  instructions  in  the  body  of  the 
loop  is  small  compared  to  the  overhead  of  process  creation,  then  the  loop  can  be 
transformed  into  a  serial  loop  consisting  of  k  iterations  nested  inside  a  parallel  loop 
of  n/k  iterations.  Thus  the  scheduling  overhead  per  iterate  is  reduced  by  a  factor  of 
k  invisibly  to  the  programmer,  and,  if  it  <  n/N,  the  effective  parallelism  is  not 
diminished.  The  value  of  k  is  most  appropriately  determined  at  runtime  as  a 
function  of  the  number  of  PEs  available  and  possibly  of  system  load.  Care  must  be 
taken  to  avoid  setting  k  too  high  and  nuUifying  the  load-balancing  properties  of  the 
"self-service"  paradigm  discussed  below.  Kruskal  and  Weiss  [84]  have  examined 
the  performance  of  such  chunking  policies  and  argue  that  even  very  naive  schemes 
can  perform  acceptably  well. 

Effective  granularity  of  programmed  parallel  functions  can  be  further  reduced 
by  pre-spawning  processes.  Here  the  programmer  or  compiler  spawns  and  suspends 
a  sufficient  number  of  processes  in  advance  of  any  parallel  constructs.  The  parallel 
loop  or  block  code  then  activates  these  processes  by  sending  a  message  or  a  signal, 
thus  "creating"  parallel  threads  of  control  without  the  overhead  of  process  creation. 
In  effect  we  are  supporting  the  parallel  constructs  with  operating  system  functions 
moved  into  usermode  code. 

Finally,  one  can  pre-spawn  non-preemptable  processes  that  busy-wait  until 
needed  by  parallel  constructs.  Far  finer  granularities  of  parallel  functions  can  thus 
be  programmed  with  high-level  constructs,  since  virtually  all  of  the  overhead  of 
process  creation,  scheduling,  and  context  switching  is  eliminated.  The  cost  is  in 
tying  up  a  larger  number  of  PEs  than  may  actually  be  used  during  much  of  the 
program  execution,  particularly  if  the  degree  of  parallelism  is  highly  volatile. 

3.5.   Performance  Issues  in  Barrier  Synchronization 

Synchronization  operations  may  be  implemented  in  two  ways:  It  is  always 
correct  for  the  processor  to  suspend  the  current  process  while  waiting  for  the 
synchronization  to  complete,  and  switch  to  another  process.  However,  considerable 
operating  system  overhead  is  incurred.  The  alternative  is  a  busy-waiting 
synchronization  routine  that  simply  loops  and  tests  whether  the  synchronization 
condition  is  satisfied.  For  short  waits,  the  overhead  involved  is  significantly  less  for 
busy-waiting  than  for  process  switching.  However,  busy-waiting  admits  the 
possibility  of  deadlock  when  the  number  of  synchronizing  processes  is  greater  than 
the  number  of  available  processors  (and  no  preemption  is  permitted).  Even  when 
deadlock  cannot  occur  it  is  possible  that  a  number  of  processors  will  loop 
unproductively  for  extended  periods  of  time,  particularly  if  one  or  more  of  the 
synchronizing  processes  is  preempted  or  swapped  out. 

A  hybrid  implementation,  in  which  each  process  busy-waits  for  a  short  period 
and  then,  if  the  condition  is  still  not  satisfied,  yields  the  PE,  would  avoid 
rescheduling  overhead  in  those  cases  where  busy-waiting  is  suitable,  and  deadlock 
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would  be  prevented.    Synchronization  issues  have  implications  for  process  and  job 
scheduling  that  are  addressed  below. 

4.   Process  scheduling 

4.1.   The  "Self-Service"  Paradigm 

Operating  systems  for  uniprocessors  and  some  multiprocessors  usually  contain 
a  single  module  that  schedules  use  of  the  processor(s)  by  assigning  processes  as 
appropriate.  While  this  centralized  approach  assures  favorable  load  balancing,  it 
introduces  a  serial  bottleneck  that  will  limit  overall  performance  on  highly  parallel 
machines.  Alleviating  this  bottleneck  by  designating  two  or  more  schedulers,  each 
managing  a  portion  of  the  processors,  leads  to  an  interesting  tradeoff:  if  the  number 
of  schedulers  is  small,  scheduling  bottlenecks  arise;  if  the  number  is  large,  effective 
load  balancing  suffers. 

Another  technique  used  to  avoid  the  bottleneck  is  stochastic  distributed 
scheduhng.  Here  a  work  queue  is  maintained  for  each  PE.  Whenever  a  process  is 
created  on  any  PE,  that  PE  assigns  the  process  on  a  random  basis  to  one  of  the 
work  queues.  In  concert  with  time-sUcing,  or  when  certain  constraints  (on  the 
variance  of  process  execution  times)  hold,  this  mechanism  can  effectively  balance 
the  load,  possibly  at  a  cost  in  scheduling  overhead  (see  Klappholz  [82]).  It  does  not 
permit  the  constant-time  spawning  of  multiple  processes  discussed  earher  since  the 
assignment  of  each  process  must  be  randomized  individually. 

The  scheduling  paradigm  adopted  for  process  scheduling  (and  other  resource 
management  functions)  in  the  Ultracomputer  is  that  of  a  self-service  system,  in 
which  one  maintains  a  single  central  queue  of  ready  processes.  Each  processor 
accesses  this  shared  queue  to  obtain  processes  for  execution  and  to  insert  newly- 
spawned  processes.  This  self-service  paradigm  relies  on  simultaneous  distributed 
processing  of  centralized  data,  and  is  highly  dependent  on  concurrently-accessible 
data  structures  that  allow  concurrent  operations  to  be  performed  without 
serialization.  The  fetch-and-add  based  mechanisms  described  earlier  are  crucial  in 
implementing  these  critical-section-free  operations,  e.g.,  queue  insertion  and 
deletion. 

The  queue  algorithms  used  are  enhanced  variants  of  the  algorithms  mentioned 
earher  that  support  three  important  additional  features: 

(1)  Multiplicity.  A  spawn  of  it  processes  is  implemented  by  the  insertion  of  an  item 
of  multiplicity  k,  which  is  "deleted"  k  times  before  actually  being  removed 
from  the  queue. 

(2)  Priorities.  Processes  may  be  inserted  onto  the  central  ready  queue  at  any  one 
of  a  (fixed)  number  of  priorities,  with  the  delete  operation  removing  the 
highest  priority  item^. 


'in  addition  to  other  more  obvious  functions,  process  priorities  are  needed  in  management  of 
nested  spawns.    The  dynamic  process  structure  of  a  program  in  which  spawns  are  nested  several 
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(3) 

Interior  removal.  In  order  to  swap  out  a  process  from  memory,  or  to 
prematurely  terminate  a  process,  it  is  occasionally  necessary  to  "delete"  an  item 
from  the  middle  of  a  queue  (see  Wilson  [86]). 

Distributed  scheduling  from  a  central  ready  queue  achieves  optimal  load 
balancing  among  the  PEs  and  also  facilitates  multiprogramming.  Unrelated  jobs 
may  contribute  processes  to  the  ready  queue;  all  will  be  scheduled  as  PEs  become 
available.  In  addition  to  its  usual  benefits,  multiprogramming  can  improve 
throughput  by  allowing  serial  sections  and  highly  parallel  sections  of  different  jobs 
to  be  overlapped. 

4.2.  Job  Scheduling 

As  indicated  above,  when  the  number  of  ready  processes  exceeds  the  number 
of  processors,  the  operating  system  uses  a  priority  based  preemptive  scheduling 
algorithm.  Although  we  expect  this  standard  multiprogramming  discipline  to  prove 
adequate  for  program  development  as  well  as  for  production  runs  of  many 
programs,  we  anticipate  that  some  sophisticated  users  solving  large  problems  will 
require  finer  control  over  scheduling.  Such  users  are  well  served  by  the  non- 
preemptable  allocation  of  a  number  of  processors,  which  are  then  assigned  to 
subtasks  under  program  control.  In  addition  to  permitting  a  problem-specific 
dynamic  choice  of  which  subtask  to  execute  next  (rather  than  relying  on  the 
operating  system's  unsophisticated  notion  of  priority),  non-preemptable  processes 
are  immune  from  the  overhead  associated  with  involuntary  context  switching.  The 
operating  system  supports  non-preemptable  processes  with  a  variant  of  the  spawn 
primitive,  which  inserts  a  non-preemptable  item  with  multiplicity  n  onto  a  high 
priority  ready  queue,  causing  the  next  n  available  PEs  to  select  these  processes  for 
execution.  Since  the  operating  system  will  not  permit  the  total  number  of  such 
processes  in  the  system  to  exceed  the  number  of  PEs,  the  time  delay  between  the 
invocation  of  the  first  and  last  instance  of  the  process  in  question  is  bounded  by  the 
preemption  interval. 

To  illustrate  one  use  of  this  facility  consider  an  application  that  requires  tight 
syndironization  between  processes,  i.e.,  one  in  which  processes  must  synchronize 
after  executing  only  a  small  number  of  instructions.  Although  the  stochastic  nature 
of  the  network  prevents  an  exact  determination  of  the  rate  of  progress  for  an 
individual  process,  with  preemption  removed  (and  with  the  Ultracomputer 
combining  network),  a  group  of  processes  will  execute  at  roughly  equal  rates  with 
high  probability.    Thus,  providing  that  the  program  segments  executed  by  each 


levels  deep  may  be  depicted  as  a  tree,  in  which  spawned  child  processes  are  represented  by  nodes 
that  are  descendants  of  their  parent  process  node.  If  the  program  is  executed  such  that  the  process 
tree  is  traversed  breadth-first,  then  there  is  a  danger  that  because  the  processes  at  each  level  are 
spawning  more  processes  before  any  process  may  terminate,  the  capacity  of  memory  or  system  tables 
may  be  overwhelmed  by  an  exponential  explosion  in  instantiated  processes.  This  is  avoided  by 
ensuring  depth-first  traversal  of  the  process  tree,  which  may  be  achieved  by  enqueuing  the  template 
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process  between  corresponding  synchronization  points  are  of  comparable  length,  a 
programmer  using  the  non-preemptable  spawn  can  profitably  synchronize  the 
resulting  processes  by  means  of  busy  waiting  loops  free  of  system  calls. 

In  simimary,  we  have  considered  a  spectrum  of  "organizational  styles"  of 
parallel  programs. 

(1)  On  one  end  of  the  scale  are  jobs  that  are  static  in  their  use  of  PEs  and  are 
subject  to  real-time  constraints.  Processes  associated  with  such  jobs  should  not 
be  preempted  under  any  circumstances. 

(2)  More  important  is  a  class  of  non-real-time  applications  referred  to  above,  which 
coordinate  parallel  activities  on  a  fixed  group  of  PEs  and  are  characterized  by 
very  frequent  internal  synchronization.  As  described  above,  busy-waiting 
synchronization  is  appropriate  for  such  jobs.  They  may  be  preempted  or  even 
swapped  out,  as  long  as  all  of  the  processes  are  preempted  together. 

(3)  Jobs  displaying  dynamic  parallelism  are  better  suited  to  our  original  model  of 
process  scheduling.  Although  internal  synchronization  will  still  be  needed 
occasionally,  it  can  be  adequately  managed  with  the  hybrid  synchronization 
mechanism  proposed  above,  even  in  the  presence  of  chunking.  Nonetheless 
there  are  reasons  to  keep  related  ("sibling")  processes  executing  concurrently: 
First,  there  are  algorithms  whose  performance  is  improved  when  parallel 
processes  execute  at  more  or  less  the  same  rate,  that  is,  when  the  execution  rate 
of  the  slowest-progressing  process  in  the  spawned  set  is  maximized.  Second, 
the  effectiveness  of  the  processor  cache  will  be  improved  when  successive 
processes  executed  on  the  same  PE  come  from  the  same  job,  since  they  are 
then  likely  to  reuse  cache  entries  for  program  code'. 

The  structure  of  the  central  ready  queue  is  further  complicated  by  this  need  to 
recognize  groups  of  sibling  processes  in  swapping  and  scheduling.  However,  the 
original  fetch-and-add  based  notion  of  bottleneck-free  inserts  and  deletes  can  be 
maintained. 

5.   Kernel  Interface 

The  set  of  operating  system  services  needed  by  parallel  programs  is  to  a  large 
degree  identical  to  that  provided  in  conventional  serial  operating  systems.  We 
enhance  the  unix  kernel  interface  with  a  small  number  of  primitives  for  creation 
and  synchronization  of  parallel  threads  of  control,  and  for  management  of  shared 
and  private  memory  areas.  The  desire  for  high  performance  also  has  ramifications 
for  other  system  functions,  such  as  I/O  and  file  system  organization,  that  are  not 
specifically  related  to  MIMD  or  shared  memory  systems.  Because  we  are  only 
beginning  to  investigate  these  last  areas,  no  further  comments  about  them  will  be 
made  in  this  paper. 


for  creation  of  spawned  children  on  the  ready  queue  at  a  priority  greater  than  that  of  the  parent. 

'Here  we  assume  that  the  cache  architecture  permits  retaining  of  cache  lines  across  a  context 
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There  are  many  different  ways  of  structuring  parallel  programs,  and  hence  the 
most  important  goal  of  the  system  interface  design  is  generality.  Both  process 
management  and  memory  management  offer  choices  for  programming  language  and 
application  designers  that  involve  complex  tradeoffs  between  ease  of  program 
design  and  debugging,  and  possible  efficiencies  to  be  obtained  through  low-level 
coding  or  lower  levels  of  protection;  between  efficient  accommodation  of  volatile 
levels  of  parallelism  and  efficient  processing  of  I/O  or  tight  internal 
synchronization;  between  turnaround  time  for  a  particular  job  and  overall 
throughput  for  a  set  of  jobs  with  varying  resource  requirements;  and  so  forth.  As 
always  in  operating  system  design,  the  kernel  must  implement  a  small  set  of 
primitives  that  provide  for  the  widest  possible  range  of  user  applications. 

Almost  all  of  the  facilities  described  in  this  section  are  implemented  in  a 
prototype  UNix-based  operating  system  at  NYU,  which  is  described  later  in  this 
paper.  However,  the  kernel  interface  is  very  much  a  work  in  progress.  Far  more 
experience  is  needed  in  designing  languages  and  programs  for  shared-memory 
multiprocessors  before  we  will  fully  vmderstand  the  requirements  for  the 
programming  model  in  service  of  parallel  programs. 

5.1.   Process  Management 

At  the  most  basic  level,  each  thread  of  control  in  a  parallel  program  is 
embodied  in  a  standard  unix  process.  We  use  the  term  job  to  refer  to  the  collection 
of  processes  executing  a  single  parallel  program,  normally  a  subtree  rooted  at  a 
process  created  by  fork.  Certain  operating  system  functions  such  as  scheduling, 
shared  memory  management,  and  signaling  may  recognize  jobs  as  well  as  individual 
processes. 

5.1.1.  System  Calls  The  5/7 awn  system  call  has  already  been  introduced.  It  is 
a  multi-way  fork  that  creates  n  processes  in  time  essentially  independent  of  n 
(unlike  an  iterated  fork,  which  would  require  time  proportional  to  n).  Spawned 
processes  are  full-blown  unix  processes,  and  they  inherit  attributes  from  the  parent 
much  like  forked  processes,  with  only  minor  exceptions*.  The  set  of  child  processes 
created  by  a  single  spawn  is  referred  to  as  a  spawn  group . 

Arguments  to  spawn  include  the  multiplicity  (n),  option  flags,  and,  optionally, 
the  location  of  an  array  used  for  reporting  of  child  processes'  exceptional 
termination  conditions.  Option  flags  include  (1)  request  for  nonpreemptable  child 
processes,  essentially  a  request  for  n  PEs,  and  (2)  request  for  cactus  stack 
processing  (to  be  discussed  in  Section  5.2.4).  The  parent  process  may  obtain  exit 
codes  from  individual  subprocesses,  or  summary  counts  representing  a  histogram  of 
the  various  termination  conditions. 


switch. 

^Spawned  processes  must  however  be  distinguished  from  forked  processes,  for  technical  reasons 
that  will  become  apparent. 
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The  spawn  call  normally  returns  inline  in  all  processes,  much  Uke  fork, 
returning  the  child's  "spawn  index"  (a  number  uniquely  chosen  from  {l,..,n})  in 
each  child  process  and  zero  in  the  parent  process.  Child  processes  then  execute 
independently  imtil  terminating  with  exit. 

Mwait  ("multiple  wait")  is  a  new  system  call  used  by  a  parent  process  to  await 
the  termination  of  all  spawned  children.  Again,  neither  mwait  nor  exit  involve 
serial  operation,  as  would  be  the  case  using  the  standard  wait  system  call,  which 
would  have  to  be  iterated.  Mwait  will  also  return  in  the  event  of  an  abnormal  child 
termination,  with  the  error  status  suitably  reported.  Furthermore,  mwait  can  be 
used  to  test  (without  blocking)  whether  outstanding  children  remain. 

A  new  signal,  SIGPARENT,  is  automatically  sent  to  all  processes  spawned  by  a 
terminating  parent.  Unlike  the  traditional  situation  with  forked  processes,  there  is 
usually  no  purpose  in  allowing  continued  execution  of  a  spawned  orphan  process. 
Experience  at  NYU  has  demonstrated  the  need  to  assist  the  programmer  in  cleaning 
up  orphans  after  an  abnormal  termination,  especially  during  debugging  of  parallel 
programs. 

5.1.2.  Low-Overhead  Parallel  Threads  unix  processes  are  relatively 
"heavy"  objects.  Associated  with  each  is  a  unique  memory  management  context 
containing  private  and  shared  memory  areas,  a  number  of  open  files,  and  a 
substantial  number  of  attributes  (userids,  current  working  directory,  signal  actions, 
etc.).  Process  creation  requires  duplicating  much  of  this  state  and  even  context 
switching  can  involve  considerable  overhead,  depending  on  the  memory 
management  architecture  and  other  factors.  In  earlier  sections  we  proposed  to 
reduce  the  impact  of  process  creation  and  destruction  by  pre-spawning  processes, 
and  to  eliminate  the  context  switching  overhead  as  well  by  permitting  non- 
preemptable  processes.  In  this  last  case  the  program  in  effect  obtains  and  then 
schedules  a  fixed  number  of  PEs;  if  multiprogramming  throughput  is  not  an  issue 
then  one  might  assign  all  or  almost  all  of  the  existing  PEs  to  an  individual 
application  in  this  manner. 

A  natural  organization  for  managing  the  "assigned"  PEs  involves  user-level 
threads  of  control  which  are  scheduled  by  user  code  into  execution  under  the  undc 
processes  which  are  fixed  on  the  individual  PEs.  These  threads  are  known  herein  as 
tasks  to  distinguish  them  from  UNIX  processes.  The  operating  system  has  no 
cognizance  of  these  tasks:  their  creation,  destruction,  synchronization,  and  resource 
assignments  are  accomplished  by  a  layer  of  software  that  is  part  of  the  user 
program,  a  standard  library,  or  the  language  runtime  environment.  The  "weight" 
of  UNDC  processes  is  no  longer  a  concern,  since  the  processes,  once  spawned,  are 
entirely  static.  However,  such  jobs  will  be  unable  to  respond  efficiently  to  highly 
varying  demands  for  service  by  the  usermode  tasks;  if  the  user  wishes  to 
dynamically  expand  or  reduce  the  pool  of  available  PEs  then  process  creation  or 
context  switching  overhead  arises. 
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There  are  further  difficulties  with  this  scheme  of  "user  multitasking"  in  the 
UNIX  process  environment.  If  a  task  modifies  any  aspect  of  the  process  state,  by 
(for  example)  opening  a  file  or  changing  the  current  working  directory,  and  that 
task  is  later  executed  under  a  different  unex  process,  its  environment  will  invisibly 
change.  File  access  will  fail  or  affect  an  incorrect  file,  the  current  working  directory 
will  be  wrong,  etc.  These  problems  appear  to  be  solvable  within  the  oirrent 
framework,  although  the  details  are  still  imder  investigation.  For  example,  the 
kernel  interface  might  be  extended  to  allow  access  to  files  opened  in  other  processes 
within  the  same  job,  and  in  general,  system  calls  (e.g.  chdir)  can  be  intercepted  by  a 
layer  of  software  that  insures  regular  system  call  semantics  are  maintained  for  each 
task. 

From  the  point  of  view  of  the  operating  system  kernel,  we  have  considered  all 
processes  to  be  homogeneous  (the  job  and  spawn  group  defined  above  are  merely 
aggregates  and  have  no  associated  attributes  or  capabilities).  In  other  work  in 
parallel  programming  environments  the  concept  of  lightweight  tasks  implemented  in 
the  kernel  has  proved  popular  (e.g..  Baron  et  al.  [85]).  Such  tasks  are  scheduled  by 
the  operating  system  but  contain  almost  no  private  state;  rather,  they  exist  within 
the  resource  domain  of  a  process  or  job.  Some  of  the  above  difficulties  would  be 
alleviated  if  such  objects  as  opened  files  were  maintained  on  a  job  level  rather  than 
a  process  level.  Other  aspects  of  the  process  state  semantics  would  change;  e.g.  the 
current  working  directory  could  no  longer  be  manipulated  by  an  individual  program 
thread.  The  overall  utility  of  lightweight  tasks  is  not  yet  known.  It  is  not  clear 
whether  they  will  enable  scheduling  of  volatile  parallel  threads  with  minimal  context 
switch  overhead.   Further  investigation  is  required  in  this  area. 

Memory  management  issues  pertaining  to  this  discussion  are  considered  in  the 
next  section. 

5.2.   User  Memory  Structure 

The  operating  system  must  provide  one  or  more  segments  within  a  job  which 
permit  data  to  be  shared  among  processes.  Here  we  consider  shared  memory  for 
storage  of  data  within  a  parallel  program  rather  than  for  general  inter-process 
commimication.  Hence  there  is  no  need  for  memory  segments  shared  among 
arbitrary  unrelated  processes,  and  there  may  be  no  need  for  dynamic  creation  of 
shared  memory  segments  other  than  in  the  course  of  program  or  process  initiation. 
Since  arbitrary  subsets  of  the  processes  constituting  a  job  may  wish  to  share 
memory,  full  generality  together  with  full  protection  would  require  a  large  number 
of  shared  segments.  In  a  pure  global  shared  memory  environment,  a  simpler,  more 
structured  approach  appears  adequate  and  natural.  Each  shared  data  area  is 
accessible  over  a  subtree  of  processes;  it  is  created  on  behalf  of  the  parent  process 
and  inherited  by  all  spawned  descendants.  Thus  the  number  of  shared  segments 
visible  to  an  executing  process  is  bounded  by  the  spawn  nesting  level.  When 
sharable  local  memory  is  present,  further  mechanisms  for  management  of  shared 
segments  will  be  required. 
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The  process  image  strongly  resembles  that  of  traditional  unix  processes. 
Logical  memory  segments  for  shared  program  text,  private  data,  and  program  stack 
remain,  though  in  some  cases  transformed.  We  augment  these  with  an  intra-job 
shared  data  segment. 

5.2.1.  Object  Files  Traditionally  the  text  and  data  segments  are  initialized  by 
the  exec  system  call  from  the  text,  data,  and  bss  (iminitialized  data)  segments  of  an 
object  {a. out)  file.  In  the  parallel  environment  more  object  file  segments  are 
required.  The  shared  data  segment  is  created  from  shared  data  and  shared  bss 
segments  in  the  a.out  file.  Furthermore,  in  architectures  supporting  local  PE 
memory,  we  may  need  the  capability  to  specify  at  compile  time  program 
components  to  be  loaded  into  the  local  memory.  TTiis  gives  rise  to  an  additional 
three  object  file  segments  for  local  text,  local  data,  and  local  bss. 

In  our  prototype  operating  system  we  have  adopted  the  Common  Object  File 
Format  (COFF)  from  AT&T  System  V  unix  ,  although  in  most  other  respects  our 
system  is  based  on  Version  7  (and  in  the  future  4.3  BSD)  unix.  COFF  provides  for 
varying  numbers  and  types  of  segments  in  the  a.out  file,  and  permits  the  needed 
flexibility. 

5.2.2.  Shared  Data  The  shared  data  segment  is  created  at  program  initiation 
(by  exec),  although  it  may  be  of  zero  length.  It  is  inherited  by  all  descendant 
processes.  The  segment  may  be  expanded  at  any  time  through  the  new  shhrk 
system  call,  which  is  usually  used  via  a  Hbrary  parallel  memory  allocator  known  as 
shmalloc.  The  sharing  of  this  data  segment  is  managed  much  like  the  traditional 
UNIX  shared  text  segment,  except  that  it  is  read-write  and  exists  only  within  a  job. 

5.2.3.  Cacheability  Control  Because  the  shared  data  segment  includes  read- 
write  variables,  accesses  are  in  general  not  cacheable.  However,  there  are  various 
circumstances  in  which  certain  variables  are  used,  either  temporarily  or 
permanently,  in  a  private  (accessed  by  only  one  process)  or  read-only  manner.  A 
process  may  dynamically  specify  the  cacheability  of  such  variables.  Means  will  be 
also  be  provided  for  flushing  and  invalidating  the  cache  as  necessary, 

5.2.4.  Private  Data  A  private  data  segment  is  created  for  each  spawned  or 
forked  process.  Its  size  is  controlled  with  the  standard  brklsbrk  system  call.  As  in 
standard  unix,  private  data  segments  are  isolated  by  memory  mapping  hardware  so 
that  even  in  case  of  user  program  error  there  is  no  possibility  of  a  private  data 
segment  owned  by  one  process  being  modified  by  another  process.  Accesses  to  the 
private  data  segment  are  always  cacheable. 

In  standard /orfc  semantics,  data  in  the  private  data  segment  is  copied  into  the 
new  private  segment  created  for  a  new  process.  When  applied  to  spawned 
processes,  this  policy  dictates  the  following  programming  language  semantics: 
Private  variables  rephcated  for  a  child  process  (e.g.  an  iterate  of  a  parallel  loop)  are 
initialized  to  the  current  value  of  the  corresponding  variable  in  the  parent's  private 
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space.  It  is  entirely  possible  that  a  programming  language  will  require  different 
behavior,  e.g.,  reinitialization  according  to  an  initializer  or  default  value  instead  of 
a  value  propagated  from  the  parent.  The  spawn  system  call  may  thus  provide  an 
option  requesting  reinitialization  instead  of  copying  of  the  private  segment. 

We  now  consider  the  impact  of  user  multitasking  on  the  private  data  segment. 
An  immediate  obstacle  arises.  If  the  private  variables  of  usermode  tasks  are 
allocated  in  the  private  data  segment,  then  each  time  a  task  moves  from  one  process 
to  another  its  private  variables  will  have  to  be  copied,  or  at  least  remapped,  from 
one  private  data  segment  to  another.  In  either  case,  sufficient  overhead  is 
introduced  into  the  usermode  context  switch  to  abrogate  much  of  the  advantage  of 
this  type  of  program  organization.  A  solution  is  to  place  the  task  private  variables 
in  cacheable  areas  of  the  shared  data  segment,  so  that  they  are  ziccessible  from  any 
process,  and  rely  on  the  compiler  and  user  code  to  (1)  isolate  these  private  subareas 
from  improper  access,  and  (2)  issue  the  appropriate  cache  flush  or  invalidate  during 
the  user  level  task  switch.  Through  avoiding  use  of  the  private  data  segment,  most 
of  the  task  switch  overhead  has  been  eliminated.  Experience  will  be  needed  to 
determine  the  dangers  of  private  data  areas  that  are  not  protected  or  isolated  by  the 
mapping  hardware.  Depending  on  the  programming  language  and  compiler,  it  is 
possible  that  debugging  of  parallel  programs  will  be  more  difficult  than  otherwise. 
Similar  issues  regarding  the  private  data  segment  arise  in  other  situations  that 
involve  pre-spawning. 

5.2.5.  Program  Stack  The  stack  segment  involves  some  of  the  same  issues  as 
the  private  data  segment.  In  standard  unix,  the  stack  is  merely  a  second  private 
data  segment,  distinguished  by  the  fact  that  it  expands  and  shrinks  automatically. 
The  simplest  policy  in  the  parallel  environment  is  to  replicate  the  entire  stack 
segment  on  spawn  as  is  done  for  fork.  The  stack  is  thus  entirely  private  and 
cacheable. 

However,  only  the  stack  frames  created  since  the  last  spawn  need  to  be  private. 
Furthermore,  sharing  the  remainder  of  the  stack  will  permit  realization  of  the 
scope-based  paradigm  for  variable  sharability  in  structured  parallel  code,  which  was 
discussed  briefly  in  Section  3.1.  When  parallel  constructs  are  nested,  in  a  block- 
structured  language,  the  automatic  variables  declared  at  each  level  are  allocated  in 
successive  stack  frames.  A  natural  implementation  of  the  scope  rules  is  to  arrange 
that  a  parent's  stack  frame  be  shared  by  its  child  processes,  and,  in  fact,  by  all  those 
processes  constituting  the  subtree  rooted  at  this  parent  process.  The  resulting 
structure  is  known  as  a  saguaro  or  cactus  stack  (Hauck  and  Dent  [68]).  Private 
stack  frames  for  active  processes  are  linked  to  the  parent's  stack  frame.  There  is  no 
serial  overhead  in  creating  or  destroying  these  private  frames  since  concurrent 
memory  requests  can  be  processed  in  parallel.  An  option  flag  in  the  spawn  system 
call  is  used  to  cause  cactus  stack  (sharing  of  existing  stack)  rather  than  private  stack 
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(copying  of  entire  stack)  processing'. 

When  user  multitasking  or  other  forms  of  pre-spawning  are  involved,  the  stack 
segment  presents  the  same  problems  as  the  private  data  segment.  The  solution, 
analogously,  is  to  allocate  space  for  all  required  program  stacks  in  the  shared  data 
segment.  Usermode  code  then  manipulates  the  stack  pointer  register  in  the  course 
of  scheduling  tasks,  implementing  logically  private  stacks  inside  the  physically 
shared  area.  Using  cacheable  and  noncacheable  areas  as  appropriate,  a  cactus  stack 
can  be  implemented  in  this  manner.  We  may  now  conclude  that  the  only  user 
memory  segment  needed  for  such  programs  is  the  shared  data  segment.  Private 
data  and  stack  segments  need  not  even  be  created.  Again,  there  are  potential 
dangers  resulting  from  the  lack  of  inter-task  storage  protection. 

5.2.6.  Local  Memory  In  architectures  that  support  local  PE  memory  as  well 
as  global  memory,  one  needs  extra  logical  segments  to  provide  local  versions  of  the 
text,  data,  and  stack.  The  IBM  RP3  further  allows  access  by  each  PE  to  the  local 
memory  of  every  other  PE.  This  is  used  for  message  passing  and  restricted  cases  of 
data  sharing. 

The  required  kernel  interface  facilities  for  support  of  local  memory  features  are 
not  yet  well  understood.  Explicit  allocation  of  private  local  memory  segments  may 
be  provided.  Furthermore,  the  program  text  and  possibly  the  stack  may  be 
"cached"  in  local  memory  invisibly  to  the  user.  When  a  local  memory  is 
insufficient  to  service  all  allocation  requests,  it  may  be  possible  to  use  the  global 
memory  as  a  backing  store,  managed  with  virtual  memory  techniques.  Several 
additional  approaches  for  utilizing  local  memory  are  also  under  investigation. 

5.3.   Usermode  Synchronization 

Both  busy-waiting  and  process-switching  synchronization  occur  among  parallel 
processes  or  tasks  within  a  user  program.  Kernel  primitives  are  provided  in  support 
of  both  forms. 

5.3.1.  Busy-waiting  synchronization  The  operating  system  is  for  the  most 
part  not  involved  in  busy-waiting  synchronization  code  in  user  programs.  No 
facilities  beyond  access  to  shared  variables  (and  perhaps  fetch-and-add)  would 
appear  to  be  required  to  implement  such  routines.   However,  two  difficulties  arise. 

(1)  In  programs  with  signal-handling  routines,  it  may  occur  that  a  process  holding  a 
lock  is  interrupted  by  a  signal  whose  handler  requires  the  same  lock.  Since  the 
signal  handler  operates  within  the  same  process,  deadlock  will  result. 

(2)  When  busy-waiting  locks  are  used  by  a  collection  of  preemptable  processes,  it  is 
possible  that  the  process  holding  a  lock  will  be  preempted  and  perhaps  swapped 


'a  new  activation  record  must  be  created  for  each  child,  so  spawn  must  be  given  the  address  of 
a  subroutine  or  code  block  to  be  executed  by  the  children.  In  this  case,  spawn  only  returns  in  the 
parent  (after  termination  of  the  children). 
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out    while    the    rest    of    the   processes,    still    in   memory    and    active,    loop 
improductively  for  a  substantial  period  of  time. 

A  new  kernel  interface  feature  is  used  to  avoid  these  problems.  The  user 
program  can  request  temporary  suspension  of  signal  delivery  and/or  preemption 
during  busy-wait  critical  sections.  Tlie  latter  is  only  a  hint,  which  the  kernel  may 
ignore,  since  it  does  not  affect  correctness.  It  is  unreasonable  to  implement  these 
functions  with  system  calls,  which  would  increase  manyfold  the  expense  of  a  busy- 
wait  synchronization.  Instead,  the  user  program  sets  these  temporary  modes 
through  specially-designated  communication  flag  variables  that  are  allocated  in  user 
memory.  The  kernel  is  informed  of  the  location  of  these  flag  words  with  a  system 
call  (so  as  to  avoid  "magic  addresses"  encoded  into  the  kernel),  and  checks  the  flags 
at  appropriate  times.  Thus  negUgible  overhead  is  added  to  the  user's 
synchronization  routines, 

5.3.2.  Process-svFitching  synchronization  Two  new  system  calls,  block  and 
unblock,  are  added  for  rehable  suspension  and  reactivation  of  processes  in  process- 
switching    usermode    synchronization    routines.     The    kernel    does    not    provide 

semaphores  or  other  coordination  facihties  directly,  but  merely  manages  process 
status. 

The  kernel  maintains  a  per-process  pending  unblock  flag,  which  is  used  to  avoid 
races  and  deadlock  in  the  user  program.  The  block  system  call  atomically  tests  the 
pending  unblock  flag,  and,  if  set,  clears  the  flag  and  returns  immediately;  otherwise, 
it  suspends  the  process  by  entering  a  blocked  state  that  is  distinguished  from  that  set 
by  pause  or  sigpause.  The  user  may  prevent  signals  from  prematurely  awakening 
the  process,  as  in  sigpause.  The  unblock  system  call  atomically  tests  whether  the 
process  is  suspended  by  a  block,  and,  if  so,  makes  it  ready;  otherwise,  it  sets  the 
pending  unblock  flag  for  the  process.  The  block  and  unblock  primitives  are  used  in 
the  obvious  manner  with  the  proviso  that  the  caller  of  block  must  be  prepared  for 
premature  returns.  This  is  because  the  pending  unblock  flag  may  be  leftover  from  a 
previous  coordination  operation.  Code  invoking  block  must  loop  so  as  to  re-block  if 
the  awaited  condition  is  still  not  satisfied. 

6.   Building  the  Parallel  Operating  System 

Since  programs  written  for  an  Ultracomputer-like  machine  will  generate 
hundreds  or  thousands  of  concurrent  activities,  we  will  encounter  a  correspondingly 
high  level  of  simultaneous  requests  for  operating  system  services.  Serial  processing 
of  these  requests  will  generate  unacceptable  bottlenecks  on  a  large  machine. 
Therefore,  the  kernel  must  itself  be  a  highly  parallel  bottleneck-free  program. 

We  have  already  outlined  an  implementation  of  processor  scheduling  satisfying 
these  constraints.  The  synchronization  primitives  and  data  structures  described 
below  in  the  context  of  the  operating  system  are  equally  applicable,  with  minor 
implementation  differences,  in  user  applications.  These  algorithms  have  been 
implemented  in  an  experimental  operating  system  running  on  a  prototype  (8- 
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processor)  Ultracomputer.  Based  on  undc  Version  7,  the  experimental  system  is 
symmetric  (i.e.,  there  is  no  master-slave  relationship)  as  well  as  parallel.  As 
described  earlier,  the  system  incorporates  facilities  for  parallel  applications 
programs.   Work  has  commenced  on  a  4.3  BSD-based  follow-on  system. 

6.1.  Data  Structures 

All  operating  system  functions  make  heavy  use  of  centrally  stored  concurrently 
accessible  data  structures  made  possible  by  the  fact  that  simultaneous  references  to 
the  same  memory  location  can  be  accomplished  in  the  time  required  for  one 
reference.   Here  we  consider  a  few  structures  that  have  proven  useful. 

We  have  avoided  pure  linked  lists,  since  we  know  of  no  bottleneck-free 
algorithm  for  deleting  items  in  a  linked  list.  The  desirable  characteristics  of  linked 
lists  must  be  found  in  other  structures.  As  will  be  seen,  many  of  these  struaures 
use  linked  lists  as  subcomponents. 

6.1.1.  Queues  Queues  similar  to  those  employed  for  process  scheduling  are 
used  by  synchronization  primitives:  The  set  of  processes  waiting  for  a  lock  or  an 
event  are  held  on  such  a  queue. 

The  queues  used  by  the  operating  system  are  somewhat  different  from  the 
simple  array  implementation  discussed  earUer.  We  obtain  queues  of  unbounded 
size  by  associating  a  linked  list,  protected  by  a  semaphore,  with  each  element  of  the 
array.  An  insertion  at  array  element  j  appends  its  item  to  the  Ust  associated  with 
element  j.  The  maximum  concurrency  supported  by  this  structure  equals  the 
number  of  lists,  i.e.,  the  size  of  the  array  (see  Rudolph  [82]).  A  variation  of  this 
queue  structure  in  which  FIFO  ordering  is  relaxed  is  also  frequently  used,  e.g.,  to 
manage  a  pool  of  free  items  to  be  allocated. 

A  similar  structure  results  when  hash  tables  are  used  to  access  indexed 
(dictionary)  information.  The  size  of  the  hash  table  (number  of  buckets)  is  set 
according  to  the  desired  maximum  concurrency,  usually  the  number  of  PEs  times  a 
small  factor.  The  buckets  are  linked  lists,  protected  by  readers-writers  locks,  so 
that  if  no  updates  are  occurring,  items  are  accessed  with  no  serialization.  An 
example  is  given  below. 

6.2.  Memory  Allocation 

The  creation  of  a  new  process  requires  allocation  of  space  for  its  u-block  as 
well  as  its  private  data  and  stack  segments,  if  any.  As  with  process  management, 
we  adopt  a  self-service  mechanism.  A  number  of  parallel  algorithms  for  memory 
management  have  been  designed,  including  (non-demand)  paging,  two  variants  of 
the  Buddy  System  (Knuth  [68]),  and  a  boundary  tag  method  (Knuth  [68]).  All  are 
parallel  analogs  of  serial  algorithms.  All,  except  for  one  of  the  Buddy  System 
variants,  maintain  queues  (whereas  their  serial  analogs  keep  lists)  of  free  memory 
blocks.  Insertions,  deletions,  and  accesses  to  blocks  within  a  concurrently-accessible 
queue  are  at  the  core  of  these  algorithms;  as  a  consequence  we  obtain  critical- 
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section-free  memory  allocation. 

6.3.   Coordination  Primitives 

An  ideal  situation  for  parallel  execution  occurs  when  completely  asynchronous 
behavior  is  permitted,  as  in  some  "chaotic"  algorithms.  Unfortunately,  however,  it 
is  sometimes  necessary  to  coordinate  processes'  accesses  to  shared  data  structures. 
In  these  cases  one  must  be  careful  to  permit  as  much  parallelism  as  possible.  Here 
we  discuss  three  mechanisms  for  process  coordination  that  we  have  used 
successfully  in  our  kernel,  namely  counting  semaphores,  readers/writers  locks,  and 
events.  When  designing  such  mechanisms,  one  must  specify  whether  a  processor 
denied  permission  should  (busy)  wait,  or  suspend  execution  of  the  current  process 
and  switch  to  another. 

6.3.1.  Busy-Waiting  Synchronization  Despite  the  potential  for  waste  in 
busy-waiting,  there  are  several  reasons  for  using  it,  including  potentially  low 
overhead  and  applicability  to  situations  where  context  switching  is  inappropriate^". 
We  have  used  busy-waiting  counting  semaphores  and  readers/writers 
synchronization  extensively  and  have  described  algorithms  for  them  before  (see 
Gottlieb  et  al.  [83a]). 

Semaphores  are  often  used  to  serialize  access  to  a  small  partition  of  a  larger 
concurrently  accessible  data  structure;  for  example,  the  individual  linked  lists  used 
to  implement  queues  of  unrestricted  size. 

Readers/writers  synchronization  is  used  naturally  in  cases  where  exclusive 
access  is  required  only  infrequently.  We  also  support  upgrading  a  read  lock  to  a 
write  lock  and  downgrading  a  write  lock  to  a  read  lock,  and  have  used  the  resulting 
protocols  to  implement  search  structures  that  must  support  the  operation:  "search 
for  an  item,  and  insert  it  if  not  found".  The  inode  table  in  our  undc  kernel,  for 
example,  is  implemented  in  this  manner.  Such  a  structure  uses  linked  lists  accessed 
through  a  hash  table,  as  described  in  Section  6.1.1.  By  performing  the  search  with  a 
read  lock,  serialization  is  avoided  in  many  instances  including  the  important  case  in 
which  many  processes  search  for  the  very  same  inode  (e.g.,  the  root  inode).  Only 
if  the  inode  is  not  found  is  it  necessary  to  upgrade  to  a  write  lock.  The  upgrade 
operation  may  fail  in  which  case  the  process  goes  back  to  search  again  (while  the 
one  process  that  succeeds  performs  the  insert). 

6.3.2.  Non-Busy-Waiting  Synchronization  Process-switching 
synchronization  is  commonly  used  in  multiprogramming  systems  since  it  permits  a 
processor  to  continue  performing  useful  work  when  the  progress  of  the  current 
process  is  logically  blocked.  As  in  other  multiprocessor  UNix  implementations  (e.g. 
Bach  and  Buroff  [84],  Felton  et  al.  [84]),  we  have  replaced  the  internal  kernel  sleep 
and  wakeup  mechanisms.  For  each  of  the  new  mechanisms  described  in  this  section, 
when  a  process  must  block  the  PE  places  it  on  a  queue  associated  with  the  condition 


'"E.g.,  in  the  implementation  of  non-busy-waiting  mechanisms  themselves. 
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to  be  satisfied,  and  executes  the  next  process  from  the  ready  queue.  When  the 
condition  is  eventually  satisfied,  the  blocked  process  is  moved  from  the  waiting 
queue  to  the  ready  queue.  Unlike  other  implementations,  we  avoid  critical  sections 
in  many  cases  by  using  fetch- and- add  and  concurrently-accessible  queues. 

The  best  examples  of  non-busy-waiting  synchronization  come  from  the  area  of 
I/O  processing,  which  takes  on  special  significance  for  the  Ultracomputer,  because 
large  numbers  of  processes  can  simultaneously  perform  related  I/O  operations.  For 
example,  searching  of  important  file  system  directories  would  be  a  bottleneck  if 
serialized.  Since  a  group  of  processes  reading  such  a  directory  would  likely  all 
attempt  to  read  the  same  disk  block;  serialization  would  be  devastating. 

At  a  low  level,  physical  I/O  devices  require  serialization;  this  is  easily  provided 
by  semaphores.  Apparent  parallelism  can  be  achieved  via  in-memory  buffer 
cacheing.  Once  a  disk  block  is  copied  into  a  memory  buffer,  it  may  be  concurrently 
accessed  for  reading;  readers/writers  synchronization  is  appropriate  in  this  situation. 

Process-switching  synchronization  is  also  used  to  implement  events,  which  are 
often  associated  with  external  occurrences,  such  as  the  completion  of  an  I/O 
operation.  At  such  a  time,  the  event  is  signaled,  remaining  in  this  state  until  reset. 
Additional  signals  (before  the  reset)  have  no  effect.  Since  an  event  might  never 
happen  (consider  input  from  a  user's  terminal),  it  must  be  possible  to  terminate  a 

process  that  is  blocked  on  such  an  event^^.  We  have  developed  a  method  of 
premature  unblocking  for  all  of  the  non-busy- waiting  synchronization  primitives 
described  above.  This  requires  the  interior  removal  primitive  previously  mentioned 
in  Section  4.1. 

There  is  a  special  difficulty  in  designing  bottleneck-free  algorithms  for 
readers/writers  and  events,  because  the  most  natural  implementation  would  require 
a  single  process  to  completely  empty  a  queue.  For  example,  when  an  event  is 
signaled,  all  processes  waiting  for  it  must  be  awakened.  One  solution  to  this 
problem  is  to  move  the  wait  queue  as  a  single  object  onto  the  system  ready  queue, 
where  it  will  be  treated  in  much  the  same  way  as  an  item  with  multiplicity.  Our 
current  implementation  lets  newly  awakened  processes  "help  out"  by  waking  up 
other  processes.  This  latter  approach  is  less  complex  than  the  "queue  of  queues" 
method,  but  requires  more  time. 

7.   Conclusion 

The  forthcoming  emergence  of  highly  parallel  machines  will  require  that 
essentially  no  serial  bottlenecks  are  introduced  by  either  hardware  or  software.  The 
fetch-and-add  coordination  primitive  provides  a  simple  and  powerful  means  for 
achieving  this  goal  by  permitting  programmers  to  employ  shared  data  structures 
without  relying  on  critical  sections.  We  believe  that  parallel  operating  systems  can 
be  built  that  perform  the  traditional  functions  of  resource  management,  scheduling, 
and    coordination    using    critical-section-free    algorithms.     The    prototype    NYU 


"Of  course  this  is  actually  done  as  part  of  signal  handling. 
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Ultracomputer  operating  system  has  been  constructed  to  demonstrate  the  feasibility 
of  this  approach.   Results  to  date  are  highly  encouraging. 

One  way  to  facilitate  the  early  use  of  highly  parallel  computers  is  to  furnish  a 
simple  programming  model  that  permits  users  to  write  programs  using  variants  of 
conventional  high-level  procedural  languages.  Further  experience  is  needed  to 
determine  whether  the  operating  system  primitives  that  have  already  been  designed 
will  effectively  support  the  parallel  constructs  needed  for  such  languages.  It  is 
especially  important  that  these  high-level  mechanisms  not  introduce  inefficiencies 
that  woxild  prevent  their  widespread  utihty.  In  particular,  using  these  mechanisms 
must  not  significantly  reduce  the  parallelism  obtained.  A  variety  of  approaches  to 
parallel  control  and  scheduling  is  being  studied  for  the  support  of  a  wide  range  of 
parallel  applications. 
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